The aim of this project is to build a machine-learning model that can predict the outcome of a WTA match in a Grand Slam tournament. To accomplish this, we are using the tennis_wta data set by Jeff Sackmann from GitHub and will be implementing a variety of techniques from the class to develop the most accurate model for this binary classification problem.
library(corrplot) # correlation plot
library(discrim) # linear discriminant analysis
library(corrr) # calculating correlation
library(knitr)
library(MASS) # assist with the markdown
library(tidyverse) # using tidyverse and tidy models
library(tidymodels)
library(ggplot2) #visualizations
library(ggrepel)
library(ggimage)
library(rpart.plot) # visualizing trees
library(vip) # variable importance
library(vembedr) # embedding links
library(janitor) # cleaning out our data
library(randomForest) # building our rand forest
library(dplyr) # basic r functions
library(yardstick) # measuring certain metrics
library(ISLR)
library(randomForest)
library(xgboost)
library(ggimage)
library(ggtext)
setwd("~/Downloads/PSTAT231/Final_Project")
knitr::opts_chunk$set(echo = TRUE, message = FALSE,
warning = FALSE)
The WTA (Women’s Tennis Association) presides over the WTA Tour, the worldwide professional tennis tour for women, and is the organizing body of women’s professional tennis. The biggest and most important tournaments in tennis exist for both the WTA and the ATP. The most important tournaments in tennis are Grand Slam major tournaments: the Australian Open, Roland Garros, Wimbledon, and the US Open. These tournaments happen every year and winning a Grand Slam would be a crowning achievement in one’s tennis career. These tournaments provide the most prize money, public and media attention, ranking points, and the biggest pool of competitors.
Grand Slams last for 2 weeks and consist of 7 rounds of single elimination games with a total of 128 players in the first round. The Grand Slam tournaments are played on different surfaces: Australian and US Open are both played on hard courts, Roland Garros on clay courts, and Wimbledon on grass courts. Since Grand Slams are the most high-profile and important tournaments in tennis, the outcomes can be unpredictable to a certain extent and provides exciting opportunities for tennis fans to watch a player make a breakthrough.
Here is a video of 2021 US Open women’s champion Emma Raducanu. This was the most memorable and unpredictable Grand Slam tournament in recent years because of Raducanu’s incredible circumstances. Raducanu entered the main draw as a qualifier which meant that she played three matches in a qualifying tournament in order to get into the main draw of the tournament. She won matches against notable and highly-ranked players without ever dropping a set. She went on to win the entire tournament without dropping a set and was the first woman to do this at the US Open since 2014 (accomplished by Serena Williams). She also is the first qualifier, female or male, to win a Grand Slam in the Open Era.
Due to the analytic revolution within sports, tennis like many other sports is utilizing data and artificial intelligence to predict the outcomes of tennis matches. Particularly for tennis, IBM Consulting has partnered with Grand Slams to use data to create captivating and interesting match insights for tennis fans. IBM’s analytics and other predictive models can help fans determine if they want to watch a certain tennis match and much more. IBM’s predictive model is incredibly robust and involves millions of data points that come from play-based data points, player information, and sentiment analysis from articles. With models like this, organizations and tournaments can attain more customers and high profits while also engaging fans with interesting statistics.
Given the project background and importance, I’ll go over the general process of creating a predictive model for tennis match outcomes. We will first perform some data manipulation and cleaning, and then perform some exploratory data analysis to get a better understanding of the data and any trends within it as well. The goal is to determine tennis match outcomes (win or loss) given match predictor variables like player information and match statistics. Next, we will perform a split on our data, develop a recipe, and set folds for cross-validation. We will be implementing a variety of models that we learned from this class: Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Lasso, Decision Tree, Random Forest, and K-Nearest Neighbor models. We will then pick the best model from these methods and analyze its accuracy and effectiveness.
We obtained our data for this project through Jeff Sackmann’s dataset on GitHub. His open-source datasets contain all the professional tennis match statistics and information from 1968 on the WTA. The data set includes information on the match (1st points won, number of aces, double faults, etc), player information (height, age, ranking, etc), and tournament information. In total there are 48 features from every match observation. There are approximately 3000 records for each year and there are over 52 years of singles professional tour match information in the dataset. Since there is so much data available from the data set, I have decided to limit the scope of my analysis to include data from the years 2019 to 2022 and to only Grand Slam tournaments.
setwd("~/Downloads/PSTAT231/Final_Project") # setting to final proj directory
wta_2022_df <-read.csv('data/wta_matches_2022.csv') # 2022 data
wta_2021_df <-read.csv('data/wta_matches_2021.csv') # 2021 data
wta_2020_df <-read.csv('data/wta_matches_2020.csv') # 2020 data
wta_2019_df <-read.csv('data/wta_matches_2019.csv') # 2019 data
wta_df<- rbind(wta_2022_df, wta_2021_df, wta_2020_df, wta_2019_df)
dim(wta_df) # How big is the data set without filtering?
## [1] 8817 49
# only interested in predicting Grand Slams
wta_df <- wta_df %>% filter( tourney_level == 'G')
dim(wta_df)
## [1] 1905 49
There are over 49 variable in this data set and over 1,905 data records. This is quite a lot of information and we need to do some cleaning before we get a data set we are ready to work with. Let’s look at some of the missing-ness that exists within this data set to understand which rows and variables we need to drop.
# how many missing values for each column
table_missing<- wta_df %>% summarise_all(~sum(is.na(.))) # doesn't provide accurate information
# missing values for ranking info columns that need to be factors
x<- sum(is.na(as.numeric(wta_df$winner_seed)))/nrow(wta_df) # about 50 percent is missing
y<- sum(is.na(as.numeric(wta_df$loser_seed)))/nrow(wta_df) # about 75 percent is missing
z<- sum(is.na(as.numeric(wta_df$winner_entry)))/nrow(wta_df) # everything is missing
t <- sum(is.na(as.numeric(wta_df$loser_entry)))/nrow(wta_df) # everything is missing
missing_ness_num<- list(x, y, z, t)
names(missing_ness_num)<- c("Winner Seed % Missing", "Loser Seed % Missing",
"Winner Entry % Missing", "Loser Entry % Missing")
missing_ness_num
## $`Winner Seed % Missing`
## [1] 0.4850394
##
## $`Loser Seed % Missing`
## [1] 0.7569554
##
## $`Winner Entry % Missing`
## [1] 1
##
## $`Loser Entry % Missing`
## [1] 1
# correlation values for these values
winner_seedc<- as.numeric(wta_df$winner_seed)
loser_seedc <- as.numeric(wta_df$loser_seed)
df_ranking_info<- data.frame(winner_seedc, loser_seedc, wta_df$winner_rank,
wta_df$loser_rank)
M<- cor(na.omit(df_ranking_info))
corrplot(M, type = 'lower')
After inspecting the level of missingness from the data we found that
the variables winner_seed, loser_seed,
winner_entry, and loser_entry had the highest
levels of missingness in the data set. In regards to match statistics,
there existed missingness for 135 rows which I suspect all come from the
same data record and we will be dealing with this later. Since
winner_entry, and loser_entry had a 100
percent of missingness, we will eliminate this variable. Also the seed
that a player has in a tournament is reflective of their current ranking
so from our correlation plot we found that winner_seed
andloser_seed were highly correlated with ranking so I
found no reason to keep this variable in our analysis. From this
inspection, I remove the variables winner_seed,
loser_seed, winner_entry, and
loser_entry from our data set.
# R redundant information based on Grand Slam data:
# tourney_id, tourney_date, draw_size, tourney_level, best_of, match_num
wta_df <- wta_df %>% dplyr::select(- c(tourney_id, tourney_date,
draw_size, tourney_level,
best_of, match_num, winner_seed,
loser_seed, winner_entry,
loser_entry))
#correlation plot to determine extra info
wta_df_numeric<- wta_df %>% dplyr::select(where(is.numeric))
M = cor(na.omit(wta_df_numeric))
corrplot(M, type = 'lower')
# Observations Significant relationships between: id and age, rank and rank points
# player id relflection physical information about them
wta_df <- wta_df %>% dplyr::select(- c(winner_rank_points, loser_rank_points))
Next, I filtered out some redundant information from the data set
based on intuition and background of playing tennis. Since we are only
concerned with Grand Slam matches and have the match and tournament set
up, I removed the variables tourney_id,
tourney_date, draw_size,
tourney_level, best_of, match_num
since they provided redundant information. Looking at the correlation
plot I identified significant relationships between id and
age, and rank and rank points.
This makes intuitive sense because a player’s id is a number associated
with a specific player and it reflect personal information about them.
Also the number of ranking points you have is what determines your
ranking, so again we can filter out this redundant information. We are
now left with 37 variables and 212 unique players in our data set.
# Missingness for player information + Creating lookup table for height
total_ht_missing<- subset(wta_df, is.na(winner_ht) | is.na(loser_ht) )
winnner_ht_missing <-subset(wta_df, is.na(winner_ht))
loser_ht_missing<-subset(wta_df, is.na(loser_ht))
wplayer_ht_missing<- winnner_ht_missing$winner_name
lplayer_ht_missing<- loser_ht_missing$loser_name
total_height_missing <- unique(c(wplayer_ht_missing, lplayer_ht_missing))
heights_vect<- c(170, 173, 178, 182, 170, NA, 168, 170, 170, 175, 169, 182,
165, 175, 164, 164, 164, 166, 178, 172, 175, 173, 170, 180,
170, 164, 180, 170, 174, 180, 160, NA, NA, NA, 183, 170,
173, NA, 171, NA, 177, 179, 170, NA, 172, NA, NA, NA, 175, 173,
NA, NA, NA, 157, 172, 173, 157, NA, 170, 165, NA, 183, NA)
height_lookup<- data.frame(name = total_height_missing, height = heights_vect)
save(height_lookup, file = "data/lookup_height.RData")
val<- which( is.na(wta_df$winner_ht) | is.na(wta_df$loser_ht))
for (i in val){
if (is.na(wta_df[i, 'winner_ht'])){
winner_name<- wta_df[i, 'winner_name']
wta_df[i, 'winner_ht'] = height_lookup[height_lookup$name == winner_name, ]$height
}
if (is.na(wta_df[i, 'loser_ht'])){
winner_name<- wta_df[i, 'loser_name']
wta_df[i, 'loser_ht'] = height_lookup[height_lookup$name == winner_name, ]$height
}
}
wta_df$winner_ht[is.na(wta_df$winner_ht)] <- round(mean(wta_df$winner_ht,na.rm = TRUE))
wta_df$loser_ht[is.na(wta_df$loser_ht)] <- round(mean(wta_df$loser_ht,na.rm = TRUE))
Previously I also discovered missingness in certain player’s heights and found there were 63 unique players that had missing information about their height. Since you can easily find this information, I actually created a look-up table that consists of all the missing player’s heights from Googling. There were only a handful of player’s whose height I was not able to find so imputed those with the average height.
# rows with missingness have all the same match stats missing and are all from the same match
wta_df<- wta_df %>% drop_na(minutes, w_ace, winner_rank, loser_rank)
# dependent on ranking and statistics but independent of player
wta_df <- wta_df %>% dplyr::select(- c(winner_id, winner_name, loser_id, loser_name))
# how many NA values are remaining
num_0<- sum(is.na(wta_df)) #0
#1760 rowa and 33 columns
dim(wta_df)
## [1] 1760 33
The last bit of cleaning I did was to deal with the match statistics
that were missing for 135 rows, so I simply dropped those rows because
match statistics are really important to determine whether or not a
player wins. I also dropped the variables winner_id,
winner_name, loser_id, and
loser_namebecause I wanted the model to perform independent
of the actual player but dependent on the player’s ranking and
statistics.
# create 2 versions of dataset, one from the loser view, one from winner view (player vs opponent)
# by default we have the winner
colnames(wta_df)<- str_replace(colnames(wta_df), "winner", "player")
colnames(wta_df)<- str_replace(colnames(wta_df), "loser", "opp")
colnames(wta_df)<- str_replace(colnames(wta_df), "w_", "p_")
colnames(wta_df)<- str_replace(colnames(wta_df), "l_", "o_")
# splitting up the set score
for (i in 1:nrow(wta_df)){
num_sets = str_count(wta_df$score[i], "-")
strscore<- unlist(str_split(wta_df$score[i], " "))
strscore<- unlist(str_split(strscore, "-"))
new_strscore<- substr(strscore,1,1)[0:2]
wta_df$p_set1[i]<- new_strscore[1]
wta_df$o_set1[i]<- new_strscore[2]
wta_df$total_sets[i]<- num_sets
wta_df$win[i]<- 1
}
# remove any match that was not completed
x<- as.numeric(wta_df$p_set1)
removeRow_set<- which(is.na(x))
wta_df<- wta_df[-removeRow_set,]
winner_df<- wta_df # dataframe from the winner POV
loser_df<- wta_df # dataframe from the loser POV
# swapp the opponent and the player
colnames(loser_df)<- c("tourney_name", "surface","opp_hand", "opp_ht","opp_ioc",
"opp_age", "player_hand" , "player_ht" , "player_ioc", "player_age", "score",
"round", "minutes", "o_ace" , "o_df" ,"o_svpt", "o_1stIn" , "o_1stWon", "o_2ndWon",
"o_SvGms" , "o_bpSaved", "o_bpFaced", "p_ace", "p_df", "p_svpt", "p_1stIn",
"p_1stWon", "p_2ndWon", "p_SvGms" , "p_bpSaved","p_bpFaced","opp_rank", "player_rank", "o_set1","p_set1","total_sets", "win")
loser_df$win<- 0
loser_df2 <- loser_df[, match(colnames(loser_df), colnames(winner_df))] # now columns are in same order
total_wta_df<- rbind(winner_df, loser_df2)
total_wta_df <-total_wta_df %>% dplyr::select(- c(score))
total_wta_df <- total_wta_df %>% mutate_at(c('p_set1', 'o_set1'), as.numeric)
There are some variable we need to create and rename, and we need to
slightly change how the information should be represented in order to
actually perform binary classification. In the data set we are using,
each row represents a match from the winner’s perspective. Thus there is
information about a winner and a loser but this set up only has one
outcome: winning. So, we need to restructure the data set to be from the
player’s perspective, not from the winner’s perspective. To create this
neutrality and binary outcomes we renamed the columns that contained
‘winner’ or ‘w’ to be player (p) and columns that contained ‘loser’ or
‘l’ to be opponent (opp). Then I created a new variable win and recorded
the match outcome from the player’s perspective (which is all 1s since
the information is presented from the winner’s/player’s perspective). So
to get information on match losses, I simply duplicated the data set and
switched the column information so now the data set would be from the
loser’s perspective. I combined the data set from the winner’s and
loser’s perspective to get a final data set that has an equal
distribution of match outcomes and a binary outcome This new structured
format of the data has 2 rows for each match: one from the loser’s
perspective and one from the winner’s perspective so that we will be
able to predict match outcomes. I also engineered three variables in
this data set: the number of games won in the first set by the winner
and the loser which was derived from the score column that
contained the full match score (ex: 6-3, 4-6, 6-2) and total sets
played. Since you can’t predict the outcome of a player winning a watch
with the score, we decided to remove this variable but only keep
information from the first set. I decided to keep the first set score
because I believed we would be able to build a better predictive model
by predicting a player’s probability of winning after the first set was
played.
dummy_2L<- total_wta_df %>% mutate(opp_ioc = factor(opp_ioc),
player_ioc = factor(player_ioc),
player_age = base::round(player_age),
opp_age = base::round(opp_age)) %>%
mutate(round = recode(round, "R128" = 1, "R64" = 2, "R32" = 3, "R16" = 4, "QF" = 5, "SF" = 6, "F" = 7),
surface = recode(surface, "Hard" = 1, "Clay" = 2, "Grass" = 3 ))
testing_df <- dummy_2L %>% dplyr::select(where(is.numeric))
M = cor(testing_df)
corr_with_win<- as.data.frame(sort(M['win',], decreasing = T))
corrplot(M,type = 'lower')
# Takeaway: surface, round, minutes, and total sets have 0 correlation with win, so we can remove them
total_wta_df_clean <- total_wta_df %>% dplyr::select(- c(surface, round, minutes, total_sets, player_hand, opp_hand)) %>%
mutate(opp_ioc = factor(opp_ioc), player_ioc = factor(player_ioc), win = factor(win),
player_age = base::round(player_age), opp_age = base::round(opp_age), tourney_name = factor(tourney_name))
total_wta_df_eda <- total_wta_df %>%
mutate(opp_ioc = factor(opp_ioc), player_ioc = factor(player_ioc), win = factor(win),
player_age = base::round(player_age), opp_age = base::round(opp_age), tourney_name = factor(tourney_name),
player_hand = factor(player_hand), opp_hand = factor(opp_hand))
save(total_wta_df_eda, file = 'data/eda_wta_df.RData')
save(total_wta_df_clean, file = 'data/clean_wta_df.RData')
Before finalizing my data set, I performed one more correlation plot to determine insignificant features and converted some categorical variables (surface and round) to numeric. I found that surface, round, minutes, and total sets played have 0 correlation with win, so I decided to remove those variables. The last thing I did was make the player country, match outcome, tournament name, player’s hand, a factor and rounded age to a whole number.
As demonstrated in our work above, we eliminated observations that
were missing key statistics and variables based on my familiarity with
tennis and knowing what variables are important. I trimmed the data down
to 32 key variables, with 31 of them being predictors and 1
win, our response variables. The variables that I selected
for the clean data set are as following. After running models with all
these variables, I ultimately removed four of these variables in order
to prevent rank deficiency and increase the rigor of prediction which I
will explain later on.
tourney_name: Tournament Name (Australian Open, Roland
Garros, Wimbledon, US Open)
player_ht, opp_ht : Player Height (cm)
player_ioc, opp_ioc: Three-Character
Country Code
player_age, opp_age: Age, in years, at time
of the match
player_rank, opp_rank: player’s/opponents
WTA rank, as of the date of the match
p_df, o_df: player’s/opponents number of
double faults
p_ace, o_ace: player’s/opponents number of
aces
p_svpt, o_svpt: player’s/opponents number
of service points
p_1stIn, o_1stIn: player’s/opponents number
of first serves made
p_1stWon, o_1stWon: player’s/opponents
number of first serves points won
p_2ndWon, o_2ndWon: player’s/opponents
number of second serves points won
p_SvGms, o_SvGms: player’s/opponents number
of serve games
p_bpSaved, p_bpSaved: player’s/opponents
number of break points saved
p_bpFaced, p_bpFaced: player’s/opponents
number of break points faced
p_set1, o_set1: player’s/opponents number
of games won in set1
win: match outcome from player’s perspective
With our data now clean and ready to use, I wanted to take a look at the effect some of our variables have match using ggplot visualizations. Before getting started though, I performed some basic visualizations to understand the distributions of the variables we are going to be working with.
load('data/eda_wta_df.RData')
load('data/clean_wta_df.RData')
# Basic Distribution
ggplot(total_wta_df_eda, aes(x=player_age)) +
geom_histogram(bins = 20, fill = "darkolivegreen2") + xlab("Player Age")+
ggtitle("Histogram of Player Age (years)") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(total_wta_df_eda, aes(x=player_ht)) +
geom_histogram(bins = 20, fill = "darkolivegreen2") + xlab("Player Height")+
ggtitle("Histogram of Player Heights (cm)") +
theme(plot.title = element_text(hjust = 0.5))
In the histogram of player age, we see that most tennis players who compete at Grand Slams are between 20 and 30 years old. This makes sense because tennis is a highly physical sport and many of the players are at the peak of their physical prowess that this age. The distribution is normal is and unimodal with the data centered around 25 years of age.
In the histogram of player height, we see that most tennis players who compete at Grand Slams are between 169 and 181 centimeters. The plot is relatively normal but is slightly skewed to the right. This is logical because tennis players who are taller have higher body mass and can hit the ball harder so many people who choose to be professional tennis players already have these desired qualities.
ggplot(total_wta_df_eda, aes(x=minutes)) +
geom_histogram(bins = 25,alpha = 0.5, fill = "darkolivegreen2") +
xlab("Match Length (minutes)")+
ggtitle("Histogram of Match Lengths") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(aes(xintercept = mean(minutes)),col='red',size=1)
ggplot(total_wta_df_eda, aes(x = tourney_name, y = minutes, color = tourney_name)) +
geom_violin(trim = FALSE) + xlab("Match Length (minutes)")+
ggtitle("Histogram of Match Length by Grand Slam") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_summary(fun.data=mean_sdl, mult=1,
geom="pointrange", color="black")
ggplot(data=total_wta_df_eda, mapping = aes(x = win, y = minutes)) +
geom_boxplot(fill = "darkolivegreen2") +
ylab("Match Length (minutes)")+ xlab("Match Outcome") +
ggtitle("Boxplot of Match Duration by Outcome ")
Next, I wanted to analyze any trends in the match duration and see how it would differ by Grand Slam because of the different surfaces. Looking at the histogram, the distribution of match duration, I see that most Grand Slam matches typically range from 50 to 200 minutes. The distribution is normal and skewed to the right with the graph centered around 95 minutes which makes sense because the average WTA match is around 90 minutes.
Because the surface of the tennis court effects the way the ball is played (certain surfaces make the ball faster while others slower) and each location comes with its own conditions, I wanted to inspect the match duration by Grand Slam. Looking at each distribution, I found that all Grand Slams relatively follow the same distributions with some interesting differences. The US Open had the longest match duration out of all the tournaments. All the Grand Slams all approximately have the same mean match length but Wimbledon has a slightly lower average. This makes sense because grass courts are the fastest out of all the surfaces and thus points tend to be shorter in length.
Lastly, I observed no significant different in match duration by outcome.
# Which countries have a certain affinity for sufaces?
country_count<- as.data.frame(table(total_wta_df_eda$player_ioc))
newdata <- country_count[order(-country_count$Freq),]
mydat1 <- total_wta_df_eda %>% dplyr::select(player_ioc,win,surface)
mydat1_val <- as.data.frame(mydat1 %>% group_by(player_ioc, surface) %>% count(win)) %>%
dplyr::filter(win == 1)
country_win_clay<- mydat1_val %>%dplyr::filter(surface == 'Clay') %>%
dplyr::select(-c(surface,win))
country_win_grass<- mydat1_val %>%dplyr::filter(surface == 'Grass') %>%
dplyr::select(-c(surface,win))
country_win_hard <- mydat1_val %>%dplyr::filter(surface == 'Hard') %>%
dplyr::select(-c(surface,win))
clay_win_graph <- ggplot(data=country_win_clay, aes(x=player_ioc, y=n)) +
geom_bar(stat="identity", fill = "coral2") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) +
xlab("Country") + ylab("Number of Wins") + ggtitle("Number of Wins on Clay by Country")
grass_win_graph <- ggplot(data=country_win_grass, aes(x=player_ioc, y=n)) +
geom_bar(stat="identity", fill = "darkolivegreen3") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) +
xlab("Country") + ylab("Number of Wins") + ggtitle("Number of Wins on Grass by Country")
hard_win_graph <- ggplot(data=country_win_hard, aes(x=player_ioc, y=n)) +
geom_bar(stat="identity", fill = "deepskyblue3") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) +
xlab("Country") + ylab("Number of Wins") + ggtitle("Number of Wins on Hard by Country")
clay_win_graph
grass_win_graph
hard_win_graph
Looking at these plots, we see that the country with the most amount of wins on any surface is the US followed by the Czech Republic. This is because there are 594 US and 253 Romanian players in the data set and comprises a significant amount of the data set. Using these graphs I determined the top 10 performing country for each surface.
Clay: USA CZE RUS ROU FRA POL ESP AUS UKR BLR
Grass: USA CZE ROU GER AUS GBR FRA BEL POL RUS
Hard: USA CZE RUS AUS BLR ROU CHN FRA UKR ESP
This information reflects the most frequent types of courts for each country and also the popularity of each Grand Slam in a country. For example Great Britain is in the top 10 best performing countries for grass court tournaments. This makes sense because grass courts are most common in the UK and is home to Wimbledon as well which always brings out more home favorites. The most common surfaces in Russia and Spain are clay and this also is reflected in the bar plot.
height_data <- as.data.frame(total_wta_df_eda %>% dplyr::select(player_ht, win) %>%
group_by(player_ht) %>% count(win)) %>% mutate(win = factor(win))
ggplot(height_data, aes(fill=win, y=n, x=player_ht)) +
geom_bar(position="stack", stat="identity") + xlab("Player Height") +
ylab("Frequency") +
ggtitle("Number of Match Outcomes by Player Height") +
theme(plot.title = element_text(hjust = 0.5))
Because we know that height is an important attribute to a tennis player’s performance, I wanted to understand the distribution of match outcomes by height. Being taller allows you to hit faster ground strokes and more powerful serves physiologically. Looking at this plot, there didn’t seem to be too much of a difference in wins and losses. However, we can see that players who are 170 cm tall (5’ 5’‘) are more likely to lose than win and players who are 180 cm tall (5’ 9’’) are more likely to win than lose.
ggplot(data=total_wta_df_eda[1:1759,], aes(x=player_ht, y=p_ace)) +
geom_bar(stat="identity", fill = "darkolivegreen2") + xlab("Player Height") +
ylab("Number of Aces") +
ggtitle("Number of Aces by Player Height") +
theme(plot.title = element_text(hjust = 0.5))
Since being taller allows players to generate more power on serves, I looked at the relationship between the number of aces made by player height. While there isn’t a very obvious trend that shows a taller player makes more aces, we do find a high number of aces made by taller players. This is an interesting insight because having a good serve is the best way to maintain competitiveness in a match since it’s expected that a player holds their own serve.
first_serve_plot<- ggplot(total_wta_df_eda, aes(x=win, y=p_1stWon)) +
geom_boxplot(fill = "darkolivegreen2") + xlab("Match Outcome") +
ylab("Number of First Serve Points Won") +
ggtitle("Distribution of First Serve Points Won by Match Outcome") +
theme(plot.title = element_text(hjust = 0.5))
second_serve_plot<- ggplot(total_wta_df_eda, aes(x=win, y=p_2ndWon)) +
geom_boxplot(fill = "darkolivegreen2") + xlab("Match Outcome") +
ylab("Number of Second Serve Points Won") +
ggtitle("Distribution of Second Serve Points Won by Match Outcome") +
theme(plot.title = element_text(hjust = 0.5))
first_serve_plot
second_serve_plot
ggplot(total_wta_df_eda, aes(x=p_1stWon, y=p_2ndWon, color = win)) +
geom_point(alpha = 0.3) + geom_smooth()
Because serves are such an important facet of a tennis match, I wanted to see if there was any relationship. If you have a good serve, all you need to do is focus on breaking your opponent’s serve. Serving allows a player to have complete control from the start of a point, I wanted to examine serving performance by match outcome. Looking at the first boxplot, we can see that player who win a higher number of their first serve points are more likely to win. The same can be said about second serves made. Lastly, I plotted the number of first serve and second serve points won to determine any relationship in regards to match outcome. We can see a high cluster of red points in the lower left corner which shows that the importance of holding your serve in tennis matches. They both follow somewhat of a positive trend line but after a certain point, the losing match curve starts to dip while the winning curve increases.
ggplot(total_wta_df_eda, aes(x=p_1stIn, y=p_1stWon, color = win)) +
geom_jitter(alpha = 0.3) + xlab("Number of First Serves Made") +
ylab("Number of First Serves Points Won") +
ggtitle("Relationship Between First Serves Made and Won by Match Outcome")+
geom_smooth(method=lm, se=TRUE, aes(group=win)) +
theme(plot.title = element_text(hjust = 0.5))
Next, I examined the relationship between the number of first serves
made and first serve points won. Because of the complete control that a
player has with their serve and the raw power of the shot, I’m not
surprised to see a positive relationship between these two variables.
From the plot, you can notice a discernible difference by match outcomes
as players who win their matches have a higher number of first serve
points won.
ranking_df<- (as.data.frame(total_wta_df_eda %>% dplyr::select(player_rank, opp_rank, win)) %>%
mutate(win = factor(win)))[1:1759,]
for (i in 1:(nrow(ranking_df))){
if(ranking_df$player_rank[i] < ranking_df$opp_rank[i]){
ranking_df$player_rank_better[i] = 1
}
else{
ranking_df$player_rank_better[i] = 0
}
if(ranking_df$player_rank_better[i] == 1 & ranking_df$win[i] == 1){
ranking_df$did_better_player_win[i] = 1
}
else{
ranking_df$did_better_player_win[i] = 0
}
}
ggplot(ranking_df, aes(factor(did_better_player_win))) +
geom_bar(fill = "darkolivegreen2") +
scale_x_discrete(labels=c("Lesser Ranked Player Won", "Better Ranked Player Won")) +
xlab("") +
ggtitle("Did the better ranked player win?") +
theme(plot.title = element_text(hjust = 0.5))
A tennis player’s ranking reflects a lot of information about their current form and performance. A player’s ranking is determined by how many matches, and tournaments they have one. Like any other sport, a higher ranked tennis player is also the better player. I wanted to look at how ranking plays a role in match outcomes. Here, I created a new column to determine whether or not the better ranked player won. I found that higher ranked players are more likely to win a match against a counter part with a lower ranking. This makes sense because a high ranked player is probably in better form, shape, and has more momentum than a lesser ranked player.
first_set_df<- (as.data.frame(total_wta_df_eda %>% dplyr::select(p_set1, o_set1, win)) %>%
mutate(win = factor(win)))[1:1759,]
for (i in 1:(nrow(first_set_df))){
if(first_set_df$p_set1[i] > first_set_df$o_set1[i]){
first_set_df$wonset1[i] = 1
}
else if (first_set_df$p_set1[i] < first_set_df$o_set1[i]){
first_set_df$wonset1[i] = 0
}
if(first_set_df$wonset1[i] == 1 & first_set_df$win[i] == 1){
first_set_df$win_overall_given_firstset[i] = 1
}
else{
first_set_df$win_overall_given_firstset[i] = 0
}
}
ggplot(first_set_df, aes(factor(win_overall_given_firstset))) +
geom_bar(fill = "darkolivegreen2") +
scale_x_discrete(labels=c("Lost First Set and Won Match", "Won First Set and Won Match")) +
xlab("") +
ggtitle("Frequency of First Set Outcomes and Winning a Match") +
theme(plot.title = element_text(hjust = 0.5))
Grand Slam tennis matches are played best out of 3 sets, like all professional womens tennis matches. Here, I wanted to examine how the first set outcome influences the over all match outcome. In these matches, a player needs to win 2 sets in order to win. The outcome of the first set greatly effects a player’s mindset and psyche going into the second set. It is not surprising to find that players are more likely to win the overall match if they win the first set.
ggplot(total_wta_df_eda, aes(x=p_1stIn, y=p_bpFaced, color = win)) +
geom_point(alpha = 0.2) + xlab("Number of First Serves Made Per Match") +
ylab("Number of Breakpoints Faced Per Match") +
ggtitle("First Serves Made and Breakpoints Faced by Match Outcome")+
geom_smooth(method=lm, se=TRUE, aes(group=win)) +
theme(plot.title = element_text(hjust = 0.5))
Because serving gives tennis player’s a bigger advantage when playing a point, it is expected that when a player is serving they win that game. A break point is a point where the opponent has a chance to win a game when their opponent in serving. The higher the number of breakpoints faced by a player, the more vulnerable they are to losing their serving which sets them back in a match. So, I plotted the number of first serves made in a match against the number of breakpoints faced. Here, we can see a cluter of blue points in the bottom half of the plot which means that player who win their matches face a smaller number of breakpoints because of a more reliable and powerful first serve. Here we can see the benefits of having a good serve is necessary to win a match and will make you less vulnerable to breakpoints.
ggplot(total_wta_df_eda, aes(p_bpFaced)) +
geom_bar(aes(fill = win)) + xlab("Number of Break Points Faced Per Match") +
ylab("Count") +
ggtitle(" Number of Breakpoints Saved Per Match by Outcome")
ggplot(data=total_wta_df_eda, mapping = aes(x = win, y = p_bpFaced)) +
geom_boxplot(fill = "darkolivegreen2") +
ylab("Number of Breakpoints Faced Per Match")+ xlab("Match Outcome") +
ggtitle("Boxplot of Number of Breakpoints Faced Per Match by Outcome ")
If a player faces a high number of breakpoints, that means they are not able to hold their serve very well and are vulnerable to losing the set. There’s more pressure to win break points in order for a player to hold their service game. Here we can see that a player who faced less breakpoints are more likely to win because they are not risk of losing their service game.
From the histogram we found that as the number of breakpoints faced increased during a match, it was more likely that the player would lose the match. This was reflected in the boxplot as well.
ggplot(total_wta_df_eda, aes(p_bpSaved)) +
geom_bar(aes(fill = win)) + xlab("Number of Break Points Saved") +
ylab("Count") +
ggtitle(" Number of Breakpoints Saved Per Match by Outcome")
The number of break points saved refers to the number of times in the match the player serving was facing a break point but they won that point, thus “saving it.” Like in the trend spotted for the number of breakpoints faced, it fewer number of breakpoints faced means a higher liklihood of winning the match. So, it’s unsurprising to see that when a player saves a smaller number of breakpoints they are more likely to win.
After performing the EDA, we have a good idea how the variables in our dataset impact the match outcome so now we can perform our training/testing split, create recipes cross validation for our prediction model.
To perform a testing/training split on the data, I decided to go with
the standard 70/30 because we don’t have too many data points to work
with so I wanted there to be a substantial amount of observations to
test and train our model. We create a split on the model to prevent over
fitting and so that we can actually determine how our model works on
unknown data. To maintain reproducibility, we set a random seed and
since we are classifying the match outcome, we stratified on the
response variable win. After splitting our data set, we
ended up with 1056 observations for the testing data set and 2462
observations for the training data set. We then performed a cross
validation fold on the training set with 10 folds and stratified on the
response variable. I performed this in a separate script and loaded it
here.
Through the data cleaning process and the EDA process, we know that
all the variables except for player and opponent player hand in our
cleaned data set is relevant to our model. As such we will be using the
same model conditions, response variables, and predictors so we are only
creating one main recipe for all our models. When making our recipe we
will dummy encode all our categorical variables and center and scale any
numerical variables. However, we decide to drop the player/opponent’s
country as a variable in the model because there are 46 countries which
means that 46 extra parameters would be encoded in the model once they
dummy encoded. I originally kept country get then I was getting a rank
deficient model so I removed them. I also removed the number of games
won by the player and the opponent in the first set because I decided I
wanted to predict the outcome of the match not the outcome of the match
given the first set. I know eliminating these variables will the
prediction more challenging. Too see the script for the recipe, cross
validation, and splitting refer to the script Models.R.
load('data/model_setup.RData')
We will be trying different machine learning techniques we learned in
the class and will be using the same recipe for all models. Since the
binary classification outcomes are balanced, we used the roc_auc and
accuracy as a metric of performance accuracy. ROC_AUC metric is shows
the trade off of sensitivity and sensibility of a model is a good metric
when the outcomes are balanced or unbalanced (in this case we have
balanced outcomes). For every model, we followed the same general steps:
set up the workflow, add the new model, and add the recipe. Then,
depending one the model we are working with we will: set up the tuning
grid with desired parameters and level of tuning, select the most
accurate model from the tuning, fit the model to the workflow, and save
results.The recipe below is what I use for all of my models and I saved
my recipe, folds, and training/testing data in a RDA file for ease which
I will be loading below too. I’ll now load the RDA files that contain
the tuning information and workflow for each model. You can see the
exact code for the models in the script called
Models.R.
load('data/bt_tune.rda') # Boosted Tree
load('data/knn_tune.rda') # K- Nearest Neighbors
load('data/dec_tree_tune.rda') # decision tree
load('data/log_tune.rda') # Log
load('data/lasso_tune.rda') # lasso
load('data/DL_tune.rda') # discriminant linear
load('data/DQ_tune.rda') # discriminant quad
In our decision tree model, I tuned three different parameters:
cost_complexity: positive number for the the cost/complexity parameter used by CART models
trees: # of trees to grow in the forest
min_n - minimum number of data values needed to create another split.
To visualize the effects of the changes in certain parameters has on ROC_AUC and accuracy I used the autoplot() function. I also extracted the best-performing model according to roc_auc and fit the final model on the whole training data set.
best_tree_roc <- select_best(tune_tree, metric = "roc_auc")
dec_tree_final <- finalize_workflow(wta_workflow_tree, best_tree_roc)
dec_tree_final_fit <- fit(dec_tree_final, data = wta_train)
autoplot(tune_tree, metric = 'roc_auc')
For the roc_auc metric, we can see that a tree depth of 8 and a higher minimal node size resulted in slightly better performing models. Additionally, larger penalties decreased the accuracy of the decision tree. Using this selected model, I combined it with my decision tree workflow on the training data set to get the best fit a decision tree model I also used rplot’s function to develop a diagram of my tree model. Looking at the plot we can see that the variable that matters the most for the outcome is number of break points faced.
dec_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint=FALSE)
Like mentioned in the EDA, a breakpoint is a very significant moment in tennis match in order for the player returning to gain an advantage or come back if they are down. The number of breakpoints faced represents the strength of one player’s returns and the weakness of the other player’s serves.
From the plot below, we observe that as the number of nearest neighbor increases, the more accurate our model was when it came to K-Nearest Neighbors. The highest roc_auc was a little over 0.90. This model is already worse than our decision tree model.
best_knn_roc <- select_best(knn_tune, metric = "roc_auc")
knn_final <- finalize_workflow(knn_workflow, best_knn_roc)
knn_final_fit <- fit(knn_final, data = wta_train)
autoplot(knn_tune, metric = 'roc_auc')
In our boosted random forest, I tuned three different parameters:
learn_rate: number for the rate at which the boosting algorithm adapts from iteration-to-iteration
mtry: proportion of predictors that will be randomly sampled at each split when creating the tree models
min_n: minimum number of data values needed to create another split.
best_bt_roc <- select_best(bt_tune, metric = "roc_auc")
bt_final <- finalize_workflow(bt_workflow, best_bt_roc)
bt_final_fit <- fit(bt_final, data = wta_train)
autoplot(bt_tune, metric = 'roc_auc')
In the plot, we can see that a higher learning rates with medium to small minimal node size and a higher number of predictors results in the highest roc_auc value. From the plot above, we can see that the smallest learning rate, the smallest node size and a medium level of predictors yielded the highest roc_auc value. The ultimate optimal mode had 13 randomly select predictors, 2 nodes, and a learning rate of approximately 1.6.
In the plot for lasso regression, we can see a small amount of regularization and a high proportion of lasso penalty results in the highest roc_auc value. Even though there are 5 lines plottes we can only see a lasso penalty of 0 and 1 which makes me think that the penalties in between perform almost exactly the same compared to the two lines present. The lasso penalty of 0 and 1 differ very slightly in roc_auc as the amount of regularization increases but a penalty of 1 makes the roc_auc drop off significantly stepper around .001.
# LASSO
best_lasso_roc <- select_best(tune_lasso, metric = "roc_auc")
lasso_final <- finalize_workflow( wta_workflow_lasso, best_lasso_roc)
lasso_final_fit <- fit(lasso_final, data = wta_train)
autoplot(tune_lasso, metric = 'roc_auc')
Lasso regression is a method that performs variable selection and regularization to improve accuracy and interpretability and can thus prevent a model from over fitting. If the penalty is small that means it choses all of the variables and there is no regularization. If the penalty is too large, then LASSO choses none of the variables and hence is too extreme. We can see that a lasso penalty between 1 and 0.25 yield the highest roc_auc value with a small to medium level of regularization. This means that there extra paramters in the model that can be set to zero.
In order to determine the best ROC AUC scores for each model, I extracted the roc_auc values from each of the models and the best models from the techniques where I was tuning, and created a dataframe in order to represent all the information at once.
# LOG
log_final_fit <- fit(wta_workflow_log, data = wta_train)
wta_log_reg_auc <- augment(log_final_fit, new_data = wta_train) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
#DL
DL_train_fit_results <- fit(wta_workflow_DL, wta_train)
wta_DL_auc <- augment(DL_train_fit_results, new_data = wta_train) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
# DQ
DQ_train_fit_results <- fit(wta_workflow_DQ, wta_train)
wta_DQ_auc <- augment(DQ_train_fit_results, new_data = wta_train) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
#Decision Tree
wta_DT_auc <- augment(dec_tree_final_fit, new_data = wta_train) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
# KNN
wta_knn_auc <- augment(knn_final_fit, new_data = wta_train) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
# Boosted Tree
wta_bt_auc <- augment(bt_final_fit, new_data = wta_train) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
# Lasso Tree
wta_lasso_auc <- augment(lasso_final_fit , new_data = wta_train) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
roc_auc_estimate<- c(wta_log_reg_auc$.estimate,
wta_DL_auc$.estimate,
wta_DQ_auc$.estimate,
wta_DT_auc$.estimate,
wta_knn_auc$.estimate,
wta_bt_auc$.estimate,
wta_lasso_auc$.estimate)
model_names<- c("Logistic Regression", "LDA","QDA","Decision Tree", "K-Nearest Neighbor", "Boosted Tree", "Lasso")
wta_model_results <- data.frame(Model = model_names,
ROC_AUC = roc_auc_estimate)
wta_model_results <- wta_model_results %>%
arrange(desc(ROC_AUC))
wta_model_results
## Model ROC_AUC
## 1 Boosted Tree 1.0000000
## 2 Logistic Regression 0.9958657
## 3 Lasso 0.9958472
## 4 LDA 0.9956340
## 5 QDA 0.9919148
## 6 K-Nearest Neighbor 0.9862468
## 7 Decision Tree 0.9512242
Looking at the table above, we can see that all the models performed extremely well, with a roc_auc value above 95. Because the accuracy is so high for all the models, I’m actually very curious to see how they perform on the test set. The boosted tree model obtained an accuracy metric is 1, which is the best a model can get. But I have a feeling that could be due to over fitting. I also produced a dot and lollipop plot to demonstrate the accuracy of each method. From both graphs we can see that the boosted tree model and the logistic regression performed the best.
wta_results_dot_plot <- ggplot(wta_model_results, aes(x = Model, y = ROC_AUC)) +
geom_point(fill = "darkolivegreen2", color = "darkolivegreen2", size=7) +
geom_segment(aes(x = Model,
xend = Model,
y=min(ROC_AUC),
yend = max(ROC_AUC)),
linetype = "longdash",
size=0.5) +
labs(title = "Performance of Tennis Match Predicting Models") +
coord_flip() + theme(plot.title = element_text(hjust = 0.5))
wta_results_dot_plot
fourthdown_lollipop_plot <- ggplot(wta_model_results, aes(x = Model, y = ROC_AUC)) +
geom_segment( aes(x = Model, xend = Model, y = ROC_AUC, yend = 0)) +
geom_point(size=6, color= "black", fill=alpha("darkolivegreen2", 0.4), alpha=0.7, shape=21, stroke=3)+
labs(title = "Performance of Tennis Match Predicting Models") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
fourthdown_lollipop_plot
From the graphs and table of the roc_auc metric for each model that
we generated we saw that all the models performed extremely well.
However, the model that performed the best was the boosted tree. Since
we tuned this model let us look at the exact parameters of this models
and see which tree is the best. Using the show_best function, we can
determine what exact parameters of mtry, min_n, and learn_rate yielded
the best accurate model. We found out that model 20 out of 100s
performed the best for the boosted tree model. The minimum number of
data points in a node that is required for the node to be split further
was 2 (min_n), 13 were predictors randomly sampled at each
split when creating the tree models, and the learn rate was 1.58.
show_best(bt_tune, metric = "roc_auc") %>% #showing the best rf model
dplyr::select(-.estimator, .config) %>%
dplyr::slice(1)
## # A tibble: 1 × 8
## mtry min_n learn_rate .metric mean n std_err .config
## <int> <int> <dbl> <chr> <dbl> <int> <dbl> <chr>
## 1 13 2 1.58 roc_auc 0.968 10 0.00161 Preprocessor1_Model20
wta_predict <- predict(bt_final_fit, new_data = wta_test, type = "class")
wta_predict_with_actual_datal <- wta_predict %>%
bind_cols(wta_test) # adding the actual values side by side to our predicted values
head(wta_predict_with_actual_datal,10)
## # A tibble: 10 × 27
## .pred_class tourney_name player_ht player_age opp_ht opp_age p_ace p_df
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 1 Australian Open 175 25 184 29 4 2
## 2 1 Australian Open 180 24 162 20 4 3
## 3 1 Australian Open 175 27 175 27 3 5
## 4 1 Australian Open 170 28 179 25 5 3
## 5 1 Australian Open 170 24 181 31 2 3
## 6 1 Australian Open 180 24 175 24 4 2
## 7 1 Australian Open 173 28 175 24 0 1
## 8 1 Australian Open 174 26 174 26 3 4
## 9 1 Australian Open 179 26 172 37 11 5
## 10 1 Australian Open 184 22 172 28 9 5
## # … with 19 more variables: p_svpt <int>, p_1stIn <int>, p_1stWon <int>,
## # p_2ndWon <int>, p_SvGms <int>, p_bpSaved <int>, p_bpFaced <int>,
## # o_ace <int>, o_df <int>, o_svpt <int>, o_1stIn <int>, o_1stWon <int>,
## # o_2ndWon <int>, o_SvGms <int>, o_bpSaved <int>, o_bpFaced <int>,
## # player_rank <int>, opp_rank <int>, win <fct>
After using the best booted tree model on our testing set, let us plot the roc_auc curve and obtain the final roc_auc curve to see how our model actually performed.
fourthdown_roc_curve <- augment(bt_final_fit, new_data = wta_test) %>%
roc_curve(win, estimate = .pred_0) # computing the ROC curve for this model
autoplot(fourthdown_roc_curve)
As mentioned earlier, the more the roc_auc curve resembles an upside down L, the better it performs. Our roc_auc plot looks very close to what an optimal curve looks like, so this is promising. Now, let’s take a look at the actual roc_auc metric of boosted tree model 20. The top right of the curve appears to be at one which is ideal too.
wta_best_test_roc_auc <- augment(bt_final_fit, new_data = wta_test) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate)
wta_best_test_roc_auc
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.961
We obtained an roc_auc value of .96. Typically, any value above 0.9 for the roc_auc is considered great for quantifying a model’s performance. This is amazing!
Since our model performed extremely well in regards to it’s roc_auc metric, I wanted to see how the model performed for specific Grand Slams.
grand_slam_names <- unlist(levels(wta_test$tourney_name))
wta_best_augmented <- augment(bt_final_fit, new_data = wta_test)
grand_slam_roc_auc <- function(gs){
estimate_column <- (wta_best_augmented %>%
dplyr::filter(str_detect(tourney_name, gs)) %>%
roc_auc(win, estimate = .pred_0) %>%
dplyr::select(.estimate))
estimate <- estimate_column$.estimate
(estimate)
}
grand_slam_roc_auc_scores <- vector("integer", 0)
for(i in grand_slam_names ){
grand_slam_roc_auc_scores <- c(grand_slam_roc_auc_scores, grand_slam_roc_auc(i))
}
logos= c("<img src='/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/Australian_Open_logo.png' width='50' /><br>*Australian Open*",
"<img src='/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/French_Open_Logo.png' width='50' /><br>*Roland Garros*",
"<img src='/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/US_Open_logo.png' width='50' /><br>*US Open*",
"<img src='/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/Wimbledon_logo.png' width='50' /><br>*Wimbledon*")
roc_auc_scores_by_grand_slam<- data.frame(Grand_Slam = grand_slam_names,
ROC_AUC = grand_slam_roc_auc_scores,
image = logos)
roc_auc_scores_by_grand_slam[,1:2]
## Grand_Slam ROC_AUC
## 1 Australian Open 0.9511899
## 2 Roland Garros 0.9628084
## 3 Us Open 0.9712987
## 4 Wimbledon 0.9581491
Looking at this table, it’s clear that the French Open and the US Open have the highest roc_auc value. Let’s plot these relationships with a bar plot.
ggplot(roc_auc_scores_by_grand_slam, aes(x = Grand_Slam, y = ROC_AUC)) +
geom_col(fill = "darkolivegreen2") + ylab("ROC AUC") +
ggtitle("ROC_AUC by Grand Slam") + theme(plot.title = element_text(hjust = 0.5)) +
scale_x_discrete(
name = "Grand Slam",
labels = logos
) +
theme(
axis.text.x = element_markdown(color = "black", size = 11)
)
As with any model and generally, the more data points you have, the better the model performs. In this case, I was curious to see the relationship between the number of data points for each Grand Slam and it’s ROC_AUC value. In our test data set, the Grand Slam with the fewest number of records is Roland Garros and then Wimbledon.Us Open and Australian Open are almost tied for the number of data records. This is interesting because the Us Open is the highest performing model and Wimbledon is the second lowest performing model.
wta_win_total <- wta_test %>%
group_by(tourney_name) %>%
summarize(n = n())
names(wta_win_total)[2] <- "Frequency"
wta_win_grand_slam_amt <- total_wta_df_clean %>%
group_by(tourney_name, win) %>%
summarize(n = n())
wta_win_total<- cbind(wta_win_total,roc_auc_scores_by_grand_slam[,"ROC_AUC"])
names(wta_win_total)[3] <- c("ROC_AUC")
wta_win_total$logos<- c("/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/Australian_Open_logo.png",
"/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/French_Open_Logo.png",
"/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/US_Open_logo.png",
"/Users/jaiuparkar/Downloads/PSTAT231/Final_Project/Media/Wimbledon_logo.png")
wta_win_total[1:3]
## tourney_name Frequency ROC_AUC
## 1 Australian Open 309 0.9511899
## 2 Roland Garros 210 0.9628084
## 3 Us Open 304 0.9712987
## 4 Wimbledon 233 0.9581491
I plotted this on a graph to get a better representation of the relationship between number of data points and model accuracy. This plot gives us some interesting results. We see that the Grand Slam that performs the best has one of the highest data points and the one that performs the second best has the smallest number of data points. There isn’t a general trend we can point out since each Grand Slam is in a different quadrant so it is interesting to note. Perhaps there isn’t a signficant relationship between the number of data points per Grand Slam and performance.
ggplot(wta_win_total, aes(Frequency, ROC_AUC)) + geom_image(aes(image=logos), size=.1) +
ggtitle("Number of Data Points per Grand Slam and Model's ROC AUC") +
theme(plot.title = element_text(hjust = 0.5)) +xlim(200, 330) + ylim(0.93,0.975) +
geom_hline(yintercept=mean(wta_win_total$ROC_AUC), linetype="dashed",
color = "red", size=0.5) +
geom_vline(xintercept=mean(wta_win_total$Frequency), linetype="dashed",
color = "red", size=0.5)
Now that we’ve investigated how the model differs by each of the Grand Slams, let’s take a look at the most important variable in our boosted tree model and see how they match up with our EDA and background knowledge of the sport. From the variable importance plot, we see that the number of breakpoints faced for the opponent and the player are the most important variables. This is important because the number of breakpoints faced reflects how well the player is holding their serve which is the biggest way to maintain competitiveness in a match and also gain an advantage (from breaking your opponent’s serve). In our EDA, we discovered at the fewer number of breakpoints that a player faces, the more likely they are to win the match.
bt_final_fit %>%
extract_fit_engine() %>%
vip(aesthetics = list(fill = "darkolivegreen2", color = "black"))
After creating a successful model to predict WTA Grand Slam tennis matches, let us actually apply this model. WTA is often very unpredictable and so are the outcomes of these Grand Slam tournaments. Let’s see how our model will perform with some of the most memorable and unexpected match outcomes in recent years.
# Extract her US Open 2021 wins
US_Open_2021 <- wta_2021_df %>% dplyr::filter(winner_name == "Emma Raducanu", tourney_name == "Us Open")
# Store information from each match
US_Open_R1 <- data.frame(
tourney_name = "Us Open",
player_ht = 175,
player_age = 19,
opp_ht = 170,
opp_age = 31,
p_ace = 3,
p_df = 3,
p_svpt = 57,
p_1stIn = 44,
p_1stWon = 29,
p_2ndWon = 6,
p_SvGms = 9,
p_bpSaved = 1,
p_bpFaced = 3,
o_ace = 2,
o_df = 1,
o_svpt = 56,
o_1stIn = 37,
o_1stWon = 18,
o_2ndWon = 6,
o_SvGms = 8,
o_bpSaved = 3,
o_bpFaced = 8,
player_rank = 150,
opp_rank = 128)
US_Open_R2 <- data.frame(
tourney_name = "Us Open",
player_ht = 175,
player_age = 19,
opp_ht = 177,
opp_age = 32,
p_ace = 3,
p_df = 0,
p_svpt = 46,
p_1stIn = 33,
p_1stWon = 28,
p_2ndWon = 7,
p_SvGms = 9,
p_bpSaved = 0,
p_bpFaced = 1,
o_ace = 2,
o_df = 5,
o_svpt = 87,
o_1stIn = 60,
o_1stWon = 32,
o_2ndWon = 12,
o_SvGms = 9,
o_bpSaved = 7,
o_bpFaced = 11,
player_rank = 150 ,
opp_rank = 49)
US_Open_R3 <- data.frame(
tourney_name = "Us Open",
player_ht = 175,
player_age = 19,
opp_ht = 176,
opp_age = 25,
p_ace = 0,
p_df = 3,
p_svpt = 39,
p_1stIn = 28,
p_1stWon = 22,
p_2ndWon = 6,
p_SvGms = 7,
p_bpSaved = 0,
p_bpFaced = 0,
o_ace = 0,
o_df = 0,
o_svpt = 54,
o_1stIn = 40,
o_1stWon = 18,
o_2ndWon = 4,
o_SvGms = 6,
o_bpSaved = 6,
o_bpFaced = 11,
player_rank = 150,
opp_rank = 41)
US_Open_R4 <- data.frame(
tourney_name = "Us Open",
player_ht = 175,
player_age = 19,
opp_ht = 175,
opp_age = 29,
p_ace = 3,
p_df = 2,
p_svpt = 66,
p_1stIn = 50,
p_1stWon = 32,
p_2ndWon = 8,
p_SvGms = 8,
p_bpSaved = 8,
p_bpFaced = 9,
o_ace = 0,
o_df = 1,
o_svpt = 41,
o_1stIn = 32,
o_1stWon = 12,
o_2ndWon = 4,
o_SvGms = 7,
o_bpSaved = 4,
o_bpFaced = 9,
player_rank = 150,
opp_rank = 43)
US_Open_R5 <- data.frame(
tourney_name = "Us Open",
player_ht = 175,
player_age = 19,
opp_ht = 175,
opp_age = 24,
p_ace = 6,
p_df = 2,
p_svpt = 60,
p_1stIn = 39,
p_1stWon = 27,
p_2ndWon = 12,
p_SvGms = 10,
p_bpSaved = 4,
p_bpFaced = 5,
o_ace = 1,
o_df = 5,
o_svpt = 54,
o_1stIn = 33,
o_1stWon = 23,
o_2ndWon = 9,
o_SvGms = 9,
o_bpSaved = 3,
o_bpFaced = 6,
player_rank = 150,
opp_rank = 12)
US_Open_R6 <- data.frame(
tourney_name = "Us Open",
player_ht = 175,
player_age = 19,
opp_ht = 172,
opp_age = 26,
p_ace = 4,
p_df = 2,
p_svpt = 56,
p_1stIn = 40,
p_1stWon = 39,
p_2ndWon = 11,
p_SvGms = 9,
p_bpSaved = 7,
p_bpFaced = 7,
o_ace = 4,
o_df = 5,
o_svpt = 60,
o_1stIn = 31,
o_1stWon = 22,
o_2ndWon = 10,
o_SvGms = 8,
o_bpSaved = 8,
o_bpFaced = 11,
player_rank = 150,
opp_rank = 18)
US_Open_R7 <- data.frame(
tourney_name = "Us Open",
player_ht = 175,
player_age = 19,
opp_ht = 168,
opp_age = 19,
p_ace = 3,
p_df = 2,
p_svpt = 71, # start after here
p_1stIn = 49,
p_1stWon = 33,
p_2ndWon = 10,
p_SvGms = 10,
p_bpSaved = 7,
p_bpFaced = 9,
o_ace = 2,
o_df = 5,
o_svpt = 78,
o_1stIn = 45,
o_1stWon = 25,
o_2ndWon = 15,
o_SvGms = 9,
o_bpSaved = 14,
o_bpFaced = 18,
player_rank = 150,
opp_rank = 73)
After saving each match that Emma Raducanu played in the 2021 US Open and it’s predictor information in a dataframe, , I predicted what the match outcome would be on our best model (boosted tree). I created a dataframe that holds the prediction of each round of the Grand Slam.
R1_predit <- as.vector(predict(bt_final_fit, US_Open_R1, type = "class")$.pred_class)
R2_predit <- as.vector(predict(bt_final_fit, US_Open_R2, type = "class")$.pred_class)
R3_predit <- as.vector(predict(bt_final_fit, US_Open_R3, type = "class")$.pred_class)
R4_predit <- as.vector(predict(bt_final_fit, US_Open_R4, type = "class")$.pred_class)
R5_predit <- as.vector(predict(bt_final_fit, US_Open_R5, type = "class")$.pred_class)
R6_predit <- as.vector(predict(bt_final_fit, US_Open_R6, type = "class")$.pred_class)
R7_predit <- as.vector(predict(bt_final_fit, US_Open_R7, type = "class")$.pred_class)
Round<- c("R1", "R2","R3","R4","R5","R6","R7")
Prediction<- c(R1_predit, R2_predit, R3_predit, R4_predit, R5_predit, R6_predit, R7_predit)
US_Open_2021_model_prediction<- data.frame(Round = Round, Prediction= Prediction)
US_Open_2021_model_prediction
## Round Prediction
## 1 R1 1
## 2 R2 1
## 3 R3 1
## 4 R4 1
## 5 R5 1
## 6 R6 1
## 7 R7 1
Surprisingly, our model predicted that she would win all 7 of her matches and win the tournament. This is exactly what what happened in real life! It was the sports story of a year. As a teenager and a qualifier for the tournament, no one would have predicted that Emma Raducanu would win the 2021 US Open. It’s pretty astonishing that our model was able to predict her miraculous win of the tournament. Obviously our model did not predict her winning the tournament but her winning all the matchups in her draw.
In 2022, American Amanda Anisimova eliminated four-time Grand Slam winner Naomi Osaka from the 2022 French Open in straight sets (but a close match) in the first round. Most of the French Open data for 2022 had missing information, so I am interested to see how our model will predict a match without any of its statistics. In our cleaning, we removed any observation that had missing value so our model has never seen this information before. It is neccesary to include that Naomi Osaka had been performing inconsistently due to mental health reasons for the past couple of years despite enormous success in the past.
Osaka_2020 <- wta_2022_df %>% dplyr::filter(tourney_name == "Roland Garros",
loser_name == "Naomi Osaka")
Oaska_R1_FR <- data.frame(
tourney_name = "Roland Garros",
player_ht = 180,
player_age = 25,
opp_ht = 180,
opp_age = 21,
p_ace = NA,
p_df = NA,
p_svpt = NA, # start after here
p_1stIn = NA,
p_1stWon = NA,
p_2ndWon = NA,
p_SvGms = NA,
p_bpSaved = NA,
p_bpFaced = NA,
o_ace = NA,
o_df = NA,
o_svpt = NA,
o_1stIn = NA,
o_1stWon = NA,
o_2ndWon = NA,
o_SvGms = NA,
o_bpSaved = NA,
o_bpFaced = NA,
player_rank = 38,
opp_rank = 28)
Oaska_R1_FR_predict <-predict(bt_final_fit, Oaska_R1_FR, type = "class")
Oaska_R1_FR_predict
## # A tibble: 1 × 1
## .pred_class
## <fct>
## 1 1
The question we asked in this prediction is: Will Anisimova beat Osaka?
Despite all the missing information about the match, our model correctly predicted that Osaka would lose the match. The match was incredibly close, with Anisimova winning 7-5, 6-4 in the end. It’s always interesting to see how great tennis players perform despite not having the best performance in recent months.
2015 was an incredible year for Serena Williams. She was in, arguably, the best for of her life and had won all 3 Grand Slams leading up to the US Open in September. However, she lost in the semi-finals. Thus coming into the 2016 season, she was in amazing form and made the finals for the 2016 Australian Open against Angelique Kerber. Going into the final, Kerber and Williams had faced each other six times with Williams holding a 5–1 advantage. Let’s see what our model predicts!
Williams_2016_AO <- data.frame(
tourney_name = "Us Open",
player_ht = 173,
player_age = 28,
opp_ht = 175,
opp_age = 34,
p_ace = 5,
p_df = 3,
p_svpt = 80,
p_1stIn = 44,
p_1stWon = 32,
p_2ndWon = 17,
p_SvGms = 15,
p_bpSaved = 4,
p_bpFaced = 8,
o_ace = 7,
o_df = 6,
o_svpt = 96,
o_1stIn = 51,
o_1stWon = 35,
o_2ndWon = 19,
o_SvGms = 14,
o_bpSaved = 4,
o_bpFaced = 9,
player_rank = 6,
opp_rank = 1)
Williams_2016_predict <-predict(bt_final_fit, Williams_2016_AO, type = "class")
Williams_2016_predict
## # A tibble: 1 × 1
## .pred_class
## <fct>
## 1 1
The question we asked in this prediction is: Will Kerber beat Williams?
Our model predicted that Angalique Kerber would win, which is what happened in real life. It was a close match with Kerber winning: 6–4, 3–6, 6–4. Coming into the final, Williams has steam rolled the competition and hadn’t dropped a set. It’s important to take note that our model doesn’t take into account the player’s recent performance like win streak, history at the tournament, and head to heads with the opponent. The information fed into the model doesn’t reflect Willaim’s historical performance during 2015.
Our boosted tree model has a roc_auc value of 0.96 on the testing data and 1 for the training data, and correctly predicted all the matches I was interested in. Just to double check that my model can actually make mistakes, let me check the confusion matrix.
Looking at the confusion matrix we can see that the total number of predicted negatives is 528 while the actual number of negatives is 535. Additionally, the total number of predicted positives is 528 while the actual number of negatives is 521. We also know that our model is more likely to predict a player looses match than it wins it.
Using these numbers we know that: \[accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{475 + 468}{475 + 468 +53+60} = 0.893\] \[precision = \frac{TP}{TP +FP} = \frac{468}{468 + 53} = 0.898\]
This means that our 89.3 % of our model’s predictions were correct and that our model did make mistakes.
That’s a relief to see, perhaps our model is just that good! Time to celebrate!
Throughout the entire project, we yielded a model that could predict the outcome of a WTA Grand Slam match extremely well but was not perfect.
One thing to keep in mind is that our model uses match statistics and player information to decide the outcome of a match. However, the match statistics can implicitly tell us who is going to win a tennis match based on how many first serves, breakpoints, and ranking a player has. The match statistics already somewhat reflect the outcome of the match if you took a closer look. This was the tennis best data set I could find and it is the most popular data set used for any research in tennis. When doing a true prediction you would not have the match statistics, but this is the best we could do. This is most likely the reason why the accuracy and roc_auc values of our model is so high and is able to predict the outcomes of match extremely well. I think it could be extremely interesting to integrate statistics about a player’s head to head record with another player, their winning streak, or even how many tournaments they have won this past year. I want to create prediction model that do not use match statistics because I believe it makes the basis of the problem too easy. Further research could figure out a way to integrate these statistics in the model and also to predict more than just Grand Slam matches.
I dropped some variables that I spent time cleaning and analyzing in my EDA like the number of games won in the first set for both players and their country. However, from running my models I thought including the number of games won during the first set would make predicting the outcome too easy since most of the time a positive result in the first set yielded a positive match outcome. I ran models including these metrics and found my roc_auc to be extremely high so I wanted to make the conditions a little more challenging so I decided to drop them. I also included both the player’s and opponent’s country but since there are 46 levels for each of those variables I would be introducing 90 more variables into my model to simply dummy encode this information which made certain models rank deficient.
From testing various models and discovering that all of them performed extremely well, I ultimately went with the boosted forest model because of the perfect roc_auc value it had on the testing data set. The boosted tree model is naturally robust to missing values and outliers which lends some insight into how our model was able to correctly predict a match outcome with missing match statistics. Since boosted trees are a combination of two techniques: decision tree algorithms and boosting methods, it is able to build trees by continuously tries to improve its accuracy when generating new trees. In this sense, this method is taking is continuously improving and building upon itself in a iterative manner.
Moving forward, I would like to have fit some more models that we learned later in the class and understand which types of data records my model is misclassifying. Additionally, I enjoyed making visualizations of the data, I also wish to have explored this aspect of the project more. Overall, this project was a great way for me to apply all the content I learned about this quarter and makes me more confident in my skills as a data scientist. I am really proud of how far I have come and this project actually inspired me to persue more independent machine learning projects.